Optimizing Wrapper-Based Feature Selection for Use on Bioinformatics Data
نویسندگان
چکیده
High dimensionality (having a large number of independent attributes) is a major problem for bioinformatics datasets such as gene microarray datasets. Feature selection algorithms are necessary to remove the irrelevant (not useful) and redundant (contain duplicate information) features. One approach to handle this problem is wrapper-based subset evaluation, which builds classification models on different feature subsets to discover which performs best. Although the computational complexity of this technique has led to it being rarely used for bioinformatics, its ability to find the features which give the best model make it important in this domain. However, when using wrapper-based feature selection, it is not obvious whether the learner used within the wrapper should match the learner used for building the final classification model. Furthermore, this question may depend on other properties of the dataset, such as difficulty of learning (general performance without feature selection) and dataset balance (ratio of minority and majority instances). To study this, we use nine datasets with varying levels of difficulty and balance. We find that across all datasets, the best strategy is to use one learner (Naı̈ve Bayes) inside the wrapper regardless of the learner which will be used outside. However, when broken down by difficulty and balance levels, our results show that the more balanced and less difficult datasets work best when the learners inside and outside the wrapper match. Thus, the answer to this question will depend on properties of the dataset.
منابع مشابه
Fuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection
Feature selection for various applications has been carried out for many years in many different research areas. However, there is a trade-off between finding feature subsets with minimum length and increasing the classification accuracy. In this paper, a filter-wrapper feature selection approach based on fuzzy-rough gain ratio is proposed to tackle this problem. As a search strategy, a modifie...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملFast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets
Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...
متن کاملDeveloping a Filter-Wrapper Feature Selection Method and its Application in Dimension Reduction of Gen Expression
Nowadays, increasing the volume of data and the number of attributes in the dataset has reduced the accuracy of the learning algorithm and the computational complexity. A dimensionality reduction method is a feature selection method, which is done through filtering and wrapping. The wrapper methods are more accurate than filter ones but perform faster and have a less computational burden. With ...
متن کاملBridging the semantic gap for software effort estimation by hierarchical feature selection techniques
Software project management is one of the significant activates in the software development process. Software Development Effort Estimation (SDEE) is a challenging task in the software project management. SDEE is an old activity in computer industry from 1940s and has been reviewed several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before softwa...
متن کامل